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ABSTRACT 

Motivation: Metabolite identification from tandem mass spectrometric 
data is a key task in metabolomics. Various computational methods 
have been proposed for the identification of metabolites from tandem 
mass spectra. Fragmentation tree methods explore the space of pos- 
sible ways in which the metabolite can fragment, and base the me- 
tabolite identification on scoring of these fragmentation trees. Machine 
learning methods have been used to map mass spectra to molecular 
fingerprints; predicted fingerprints, in turn, can be used to score can- 
didate molecular structures. 

Results: Here, we combine fragmentation tree computations with 
kernel-based machine learning to predict molecular fingerprints and 
identify molecular structures. We introduce a family of kernels capturing 
the similarity of fragmentation trees, and combine these kernels using 
recently proposed multiple kernel learning approaches. Experiments on 
two large reference datasets show that the new methods significantly 
improve molecular fingerprint prediction accuracy. These improve- 
ments result in better metabolite identification, doubling the number 
of metabolites ranked at the top position of the candidates list. 
Contact: huibin.shen@aalto.fi 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 



1 INTRODUCTION 

Metabolomics deals with the analysis of small molecules and 
their interactions in living cells. A central task in metabolomics 
experiments is the identification and quantification of the metab- 
olites present in a sample. This is mandatory for subsequent 
analysis steps such as metabolic pathway analysis and flux ana- 
lysis (Pitkanen et al., 2010). Mass spectrometry (MS) is one of 
the two predominant analytical technologies for metabolite iden- 
tification. Identification is done by fragmenting the metabolite, 
for example, by tandem MS (MS/MS), and measuring the mass- 
to-charge ratios of the resulting fragment ions. The measured 
mass spectra contain information about the metabolite, but ex- 
tracting the relevant information is a highly non-trivial task. 

Several computational methods have been suggested to iden- 
tify the metabolites from MS/MS spectra. Mass spectral data- 
bases (spectral libraries) have been created (e.g. Hisayuki et al., 
2010; Oberacher et al, 2009; Smith et al, 2005; Tautenhahn 
et al., 2012), which allow us to search measured mass spectra. 
Unfortunately, this approach can only identify 'known un- 
knowns' where a reference measurement is available. 

Fragmentation trees are combinatorial models of the MS/MS 
fragmentation process. Bocker and Rasche (2008) suggested 
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fragmentation trees for identifying the molecular formula of an 
unknown compound. Later, fragmentation trees were shown to 
contain valuable structural information about the compound 
(Rasche et al, 2011, 2012). 

The relation between spectral and structural similarities has 
been studied by Demuth et al. (2004). A kernel-based machine 
learning approach for metabolite identification was recently 
introduced by Heinonen et al. (2012), relying on predicting the 
molecular fingerprints as an intermediate step. Molecular finger- 
prints are given as bit vectors with each bit describing the exist- 
ence of certain molecular property such as substructures in the 
molecule. After the prediction, imposing some scoring strategy, 
the predicted molecular fingerprints are used for searching some 
chemical database and finally the ranked list of candidates are 
generated (Heinonen et al., 2012; Shen et al., 2013). 

Besides these two approaches, methods have been suggested 
for predicting MS/MS spectra from molecular structures (Allen 
et al., 2013; Kangas et al., 2012); commercial software packages 
also exist for this task. Such simulated spectra can be used to 
replace the notoriously incomplete spectral libraries by molecular 
structure databases (Hill et ah, 2008). Combinatorial fragmenta- 
tion of molecular structure serves the same purpose (Gerlich and 
Neumann, 2013; Wolfe/ al., 2010). Finally, we can search spec- 
tral libraries for similar compounds, by comparing either MS/ 
MS spectra (Demuth et al., 2004; Gerlich and Neumann, 2013) 
or fragmentation trees (Rasche et ah, 2012). See Scheubert et al. 
(2013) and Hufsky et al. (2014) for recent reviews. 

We propose a joint strategy that combines fragmentation trees 
and multiple kernel learning (MKL) to improve molecular fin- 
gerprint prediction and, subsequently, the metabolite identifica- 
tion. We first outline the metabolite identification framework 
and introduce fragmentation trees and their computation. 
Next, we introduce a family of kernels for fragmentation trees, 
consisting of simple node and edge statistics kernels as well as 
path and subtree kernels that use dynamic programming (DP) 
for efficient computation. We then describe state-of-the-art 
methods for MKL. In these experiments, we evaluate different 
MKL algorithms with regards to the fingerprint prediction and 
the metabolite identification. 



2 METHODS 

Figure 1 gives an overview for our metabolite identification framework 
through MKL. Fragmentation trees are computed first, followed by the 
computation of kernels. MKL approaches are used to integrate different 
kernels for molecular fingerprint prediction. The final step of the frame- 
work is to query molecular structure databases with the predicted mo- 
lecular fingerprint using a probabilistic scoring function. 
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Under the parsimony assumption, we then compute a colorful subtree 
of this graph with maximum weight. Unfortunately, finding this tree is an 
NP-hard problem (Rauf et al., 2012). Nevertheless, we can compute op- 
timal trees in a matter of seconds using Integer Linear Programming 
(Rauf et al, 2012). For each peak, this tree implicitly decides whether 
it is noise or signal and, in the later case, assigns the molecular formula of 
the corresponding fragments and the fragmentation reaction it resulted 
from. The score of the tree is the sum of its edge weights. Candidate 
molecular formulas of the parent peak are ranked by this score, which 
is the maximum score of any tree that has this molecular formula as its 
root. 

Different from Bocker and Rasche (2008) and Rasche et al. (201 1), we 
used a modified weighting function for the edges of the fragmentation 
graph. With these new weights, the above optimization can be interpreted 
as a maximum a posteriori estimator of the observed data. We weight 
edges by the logarithmic likelihood that a certain fragmentation reaction 
occurs: for this, we consider the intensity and mass deviation of the prod- 
uct fragment peak, the loss mass and chemical properties of the molecular 
formula as proposed in Kind and Fiehn (2007): namely, the ring double 
bond equivalent and the hetero atoms and carbon atoms ratio. 
Furthermore, we favor a few common losses that were learned from 
the data, and penalize implausible losses and radicals. Such weights 
have already been used in Bocker and Rasche (2008) and Rasche et al. 
(2011); different from there, we did not choose parameters ad hoc but 
rather learned them from the data. Details about these new weights will 
be published elsewhere. 



Fig. 1. The metabolite identification framework through MKL. First, we 
construct the fragmentation tree from the MS/MS spectrum. Second, we 
compute kernels for both MS/MS data and fragmentation trees. Third, 
MKL is used to combine kernels and predict molecular fingerprints. 
Finally, fingerprints are used for molecular structure database retrieval 



The advantages of the kernel-based machine learning framework are: 
that it easily allows incorporating the combinatorial fragmentation trees 
by kernelizing the model; that it can query molecular structure databases 
which are much larger than MS/MS spectral libraries; and that molecular 
fingerprints can help to characterize the unknown metabolite and may 
shed light for de novo identification. 



2.1 Fragmentation trees 

Bocker and Rasche (2008) introduced fragmentation trees to predict the 
molecular formula of an unknown compound using its MS/MS spectra. 
A fragmentation tree annotates the MS/MS spectra of a compound via 
assumed fragmentation processes. Nodes are molecular formulas, repre- 
senting the unfragmented molecule and its fragments. Edges represent 
fragmentation reactions between fragments, or the unfragmented mol- 
ecule and a fragment. Details on the computation can be found in 
Bocker and Rasche (2008) and Rasche et al. (2011); here, we quickly 
recapitulate the method. We assume that MS/MS spectra recorded at 
different collision energies have been amalgamated into a single spectrum, 
as described in Section 3. We decompose all peaks in the amalgamated 
spectrum, finding all molecular formulas that are within the mass accur- 
acy of the measurement. For each decomposition of the parent peak, we 
build a fragmentation graph which contains all possible explanations for 
each peak, where nodes are colored by the peaks they originate from. We 
insert all edges between nodes that are not ruled out by the molecular 
formulas: that is, a product fragment can never gain atoms of any element 
through the fragmentation. Edges of this graph are then weighted, taking 
into account the intensity and mass accuracy of the product fragment, the 
mass of the loss and prior knowledge about the occurrence of certain 
losses. 



2.2 Kernels for fragmentation trees and MS/MS spectra 

2.2.1 Probability product kernel Heinonen et al. (2012) compared 
several kernels that can be computed directly from the MS/MS spectra 
without the knowledge of the fragmentation trees. In their studies, simple 
peak and loss matching kernels were found inferior to the probability 
product kernel (PPK). Thus, we use the PPK as the baseline comparison 
with the fragmentation tree kernels. The idea of the PPK is the following: 
each peak in a spectrum is modeled by a 2D Gaussian distribution with 
the mass-to-charge ratio as one dimension, and the intensity as the other. 
All-against-all matching between the Gaussians is performed to avoid 
problems arising from alignment errors. 

Formally, a spectrum is defined as x = {xOX ■ ■ ■ . x(^j<))' a set of i x 
peaks x(k) — (jt(k), i(kj) e R , (k= 1, . . . , i x ) consisting of the peak mass 
ix(k) and the normalized peak intensity t(k). The k-th peak of the mass 
spectrum x is represented by p x m =Af(x(k), X) centered around the peak 
measurement and with covariance shared with all peaks 



o 



where the variances <x^ for the mass is estimated from data and a\ is 
tuned by cross-validation. No covariance is assumed between peak dis- 
tributions. The spectrum x ' s finally represented as a mixture of its peak 

distributions p x = i ( p m . 

The PPK Speaks (Jebara et al,, 2004) between the peaks of two spectra 
X, x' is given by: 



Px(*)Px'( x ) dx 



exp(- 



(X(k) - x'(k')) T X-\x(k) - XW))- 
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The precursor ion is the compound selected in the first round of MS/ 
MS and further fragmented in the second round. As a result, the differ- 
ence (loss) between the peak x(k) and the precursor ion prec(x) = (,u(f>),0) 
is also important, where ii(p) is the mass of the precursor ion. We can 
model the difference with distribution =Af(x(k), X), where 
/(&) = |prec(x) — x(k)\- This feature is denoted as loss and corresponding 
kernel matrix as A^ oss . Experiments in Heinonen et al. (2012) and Shen 
et al. (2013) showed that the combined kernel ^ pea k S + -Ki oss achieved best 
accuracy and computational efficiency among the spectral kernels. 

2.2.2 Fragmentation tree kernels Fragmentation trees can be con- 
sidered as an annotated representation of the original MS/MS spectra. 
Recent advancement (Rasche et al, 2012; Rojas-Cherto et al, 2012) in 
comparing and aligning the fragmentation trees enables similarity metrics 
to be defined between fragmentation patterns for small molecules. Rasche 
et al. (2012) introduced fragmentation tree alignments, and showed align- 
ment scores to be correlated with chemical similarity. However, alignment 
scores of this type do not, in general, yield positive semidefinite kernels. 
In the following, we define a set of kernels for fragmentation trees that 
will allow us to transfer the power of the fragmentation tree approach to 
the kernel-based learning algorithms for molecular fingerprint prediction 
and metabolite identification. 

A fragmentation tree T = (V, E) consists of a nodes set V of molecular 
formulas (corresponding to the fragments) and an edges set E c Vx V 
(corresponding to the losses). Let r denote the root of T. For an edge 
e—(u, v) 6 E let k(e) = k(u, v) :—u — v be the molecular formula of the 
corresponding loss. Clearly, different edges may have identical losses; let 
k(E) be the multiset of all losses. For some loss molecular formula /, let 
N(l) be the number of edges e e E with k(e) = /. Each path from the root 
r to a node v implies a root loss r — v; let £ : = {r — v : v 6 V) be the set of 
all root losses. For a MS/MS spectrum x, let T x - (V X ,E X ) be the corres- 
ponding fragmentation tree, with root losses £ x and loss multiplicities 
N x (-). For any node v 6 V x let ijy) be the corresponding peak intensity; 
for an edge e = (u, v) e E x let i x (e) be the intensity of the terminal node v. 

For the loss- and node-based kernels, feature vectors tj> are constructed 
and the kernel function is just a simple dot product between two feature 
vectors. Path-based kernels are more complicated, and details on their 
computation will be given below. 

Loss-based kernels: edges in the fragmentation trees represent the losses 
from the parent node to the child node. The following feature vectors are 
devised based on the losses in a fragmentation tree T x : 

• LB: Loss binary, indicates the presence of a loss / in a fragmentation 
tree T x , that is, = hex^y 

• LC: Loss count, counts the number of occurrences of a loss / in a 
fragmentation tree T x , that is, 0f c (x) = N x (l). 

• LI: Loss intensity, uses the average intensity of the terminal 
nodes with loss / in a fragmentation tree T x , that is, 

• RLB: Root loss binary, indicates the presence of a root loss / in a 
fragmentation tree T„ that is, 0f LB (.x)= l; e £ t . 

• RLI: Root loss intensity uses the intensity of the terminal node of a 
root loss if it is present in a fragmentation tree T x . For root r we set 
0f LI (.\-) = i x (r - I) if r - I e V x , and 0f LI (x) = 0 otherwise. 

Node-based kernels: the nodes in the fragmentation tree explain peaks in 
the MS/MS by some chemical formula of the hypothetical fragment. The 
nodes are unique in a fragmentation tree T, and so are the root losses. To 
this end, we can omit root losses from the feature vectors. 

• NB: Nodes binary, indicates the presence of a node v in a fragmen- 
tation tree T x , that is, 0™(x)= l ve i V 



• NI: Nodes intensity, uses the intensity of the node if it is presented 
in a fragmentation tree T x ; that is, <j^ L (x) = i x (v) for re V x , and 
0f t (x) = O otherwise. 

Path-based kernels: these kernels are count common path between two 
fragmentation trees — here, 'common path' refers to an identical sequence 
of losses in the two trees. We use DP to efficiently count the number of 
common paths, that is, the dot product of two feature vectors which are 
not explicitly constructed. For two fragmentation trees T\ - (Vi,E{) and 
T 2 - (V 2 ,E 2 ) we compute a DP table D[u,v] for all u e V\ and re V 2 . In 
all cases, the number of common paths is D[ri,r 2 ] where r% is the root of 
Tj. We initialize 

D[u, v] = 0,Vu e £(r,), v e T 2 

D[u, v] = 0,Vu e T u v e C(T 2 ) 

where C(T) denotes the leaves of a tree T. Let C(v) be the children of a 
node i'. 

• Common path counting (CPC). The DP table entry D[u,v] records the 
count of common path for the subtrees rooted in u and v, respect- 
ively. This leads to the following recurrence: 

D[u, v]= (\+D[a,b}). 

• Common paths of length 2 (CP2). In this case, only common losses 
for paths of length two are considered: 

D[«,v]= J2 ('+•»[«>])■ 

i(«,«0-l( w ),l( J <,«)-lO'.i) 

• Common path with -Kp^s (CPK). Instead of simply counting the 
common paths, we use the PPK K pili ,ks to score the terminal peaks. 
We omit the straightforward but somewhat tedious details. 

• Common subtree counting (CSC). In this case, we count the number 
of 'common subtrees' between T x and T 2 , which can be defined 
analogously to the common paths above. Entry D[u,v] now counts 
the number of common subtrees for the two subtrees rooted in a of 
T\, and v of T 2 . We have to consider three cases: for each pair of 
children a e C(u) and b e C(v) with X(u,a) = k (v,b) we can either 
attach the subtrees rooted in a and b; we can use solely the edges 
(u, a) and (v, b) as a common subtree; or, we can attach no common 
subtree for this pair of children. But if we choose no subtree for all 
matching pairs of children, the result would be a tree without edges 
and, hence, not a valid common subtree. Thus, we have to correct for 
this case by subtracting one. Hence, the recurrence is: 

D[u,v]= Y\ (2 + D[a.b])-\. 

««.«)-!(.,« 

2.3 MKL 

In many applications, multiple kernels from different kernel functions or 
multiple sources of information are available. MKL becomes a natural 
way to combine information contained in the kernels. Instead of choosing 
the best kernel via cross-validation as in Heinonen et al. (2012) and Shen 
et al. (2013), MKL seeks a linear, convex or even non-linear combination 
of the kernels. An overview of MKL algorithms can be found in a survey 
by Gonen and Alpaydin (2011). 

In practice, it is often difficult for MKL algorithms to outperform the 
uniform combination of the kernels (UNIMKL) where the weights for ker- 
nels are equal. However, in some cases, some methods have seen improve- 
ments over the uniform combinations. Three algorithms coupled with 
SVM are considered in the following: centered alignment-based algorithms 
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(Cortes et al., 2012), quadratic combination of the kernels (Li and Sun, 
2010) and ^-norm P> 1 for the kernel weights (Kloft et al., 2011). 

For all the three algorithms, the input will be a set of kernels 
K = {K A .|K A e I?" x ", k= 1, . . . , q] computed from n data points. The 
output is a set of m fingerprint properties Y e {— 1, + l}" xm which is a 
multi-label prediction task and each label is trained independently in the 
experiments. 

2.3.1 Centered alignment-based MKL The centered alignment- 
based MKL algorithms are based on the observation that the centered 
alignment score with the target kernel K> = yy r correlates very well with 
the performance of the kernel, where y is a single label. Experiments by 
Cortes et al. (2012) show consistent improvements over the uniform com- 
bination. In the molecular fingerprint prediction setting, the target kernel 
is defined as K y = YY r . 

Two-stage model are considered in which the kernel weights are 
learned first and then can be applied to all kernel-based learning algo- 
rithms (SVM in this work). The centered kernel matrices are defined by 
Equation (1): 



K = 



" ee T " 


K 


r Tn 




j ee 


n 




n 



(1) 



where I is the identity matrix and e is the vector with all ones. 
VA, B 6 let (■, -) F denotes the Frobenius product and || ■ || F denotes 
the Frobenius norm which are defined by 

(A,B) f =7>[A 7 'b] and ||A|| F = V(A, A) F . 



Let now K e B5" x " and K' e M" x " be two kernel matrices such that 
||K t .|| F ^0 and ||KJ.||j? ^ 0. Then the centered alignment between K 
and K' is defined by 



(K t .,K' t .) F 

llKdifiiigb 



(2) 



The simple independent centered alignment-based algorithm (ALIGN) 
(Cortes et al., 2012) computes the alignment score between each kernel 
matrix K, and the target kernel matrix K y and combine the kernels as 



K^a^p(K A ,K r )K A . 

A--1 

IIKj-lb.^ \\K k \\ F k - 



The alignment maximization algorithm (ALIGNF) (Cortes et al., 2012) 
jointly seeks the weight /i, to maximize the alignment score defined by 
Equation (2) between the convex combination of the kernel in K and the 
target kernel Ky = yy r , that is, the following optimization problem: 

(Kj,, Ky) F 

max — 

KeA-f | |K M no- 



where M = i± : 



i = l, ju > 0. 



2.3.2 Quadratic combination MKL In this setting, the quadratic 
combination of kernels (QCMKL) is included in the formulation and the 
MKL problem is solved by semidefinite programming (Lanckriet et al., 
2002; Li and Sun, 2010). The kernels in K are enriched to a new set 
K = {K,|; = 1, . . . , q(q+ l)/2) by the following transformation: 



«<,/> " 



where ij = 1, 



KioKj i+j 
K, / = / 

,q and o denotes the Hadamard product. 



The convex combinations of the kernels is given by = y _f , [i t 
K, with ijl > 0 and e T /j.= l. Adapting the soft margin SVM formula- 
tion reveals the following dual problem (in epigraph form) (Li and 
Sun, 2010): 



max u 



1 



s.t. u > a T e—-ct T G(K. ll )a, 

Q<a< Ce,a T y = 0, 
ji > 0, eV = 1- 



The derived Lagrangian for the problem is (Li and Sun, 2010): 
L(a, p, S, y) = a T e - ^GiK^a + fi T a 
+ ya T y + <5(Ce — a) 

with a, fS > 0, & > 0, y as dual variables, and G(K) = diag(y)Kdiag(y). 
Applying Schur's lemma to convert the first inequality constraint to 
Linear Matrix Inequality (LMI) unveils the following semidefinite pro- 
gram (SDP) (Li and Sun, 2010): 

min u 

a. II 

( G(K„) e + ZS + yy- 

s.t. 

\(e + p+yy - 8) T u-2CS T e 
li > 0,e J Vi=l,/3 > 0,S> 0. 



Many standard SDP solvers can be used to find the optimal solutions 
such as cvx (http://cvxr.com/). 

2.3.3 l„-norm MKL While t, norm on the kernel weights 
/j, produces sparse solutions, higher norms p > 1 produces non- 
sparse solutions which may be beneficial. A general framework 
for ^-norm MKL (^-MKL) was proposed by Kloft et al. (2011). 
The q kernels correspond to q feature mappings 

: x — *■ 7~tk, k = 1 1 q an d I is some convex loss function and the 

primal problem is then: 



t 1 ^i^^m+^y^+ii: 11 " 
"■•^ i=i k=\ 

s.t. n >0, Hm \\ 2 p < 1. 



I'k 



when the optimization is coupled with hinge loss, the problem has a 
simple dual form (Kloft et al., 2011): 

max a T e - l -\\(a T G(K f )a)l_ ,!!„., 

where all the variables are all as defined before but p* = - JL i . 

The optimization problem can be solved by alternating the dual vari- 
ables a and the kernel weights fi via the squared norm on w by the 
following equations: 



||n' A .|| 2 = ^a 7 'K A .a,V/c=l. 



Ma = —a , Vfc= 1, 



(3) 
(4) 



Based on the above equations, a simple alternating algorithm has been 
proposed by Kloft et al. (2011) as Algorithm 1. 
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Algorithm 1 Wrapper algorithm for ^-norm MKL 
Input feasible a and \i 

while optimization conditions are not satisfied do 
Solve a with current 11 using standard SVM. 
Compute ||ii'a || 2 with equation (3). 
Update \i by equation (4). 

end while 



The optimization conditions can be the difference of objective function 
or the duality gap between two subsequent iterations. More detailed, 
theoretical results and a faster chunking-based algorithm are also pre- 
sented in Kloft et at. (2011). 

2.4 Probabilistic scoring of candidate metabolites 

Given a predicted fingerprint associated with a mass spectrum, for me- 
tabolite identification, we need to retrieve metabolites with similar finger- 
prints from a molecular database. Assume y e {—1, + 1}"' is a predicted 
fingerprint and an arbitrary fingerprint y e {— 1, + 1}"' for some molecule 
in some molecular database, one can score the y by the following equa- 
tion as used in FingerlD (Heinonen et at., 2012; Shen et al, 2013): 

PpMY,f) = X\Yj"'' l (y-Yi) l "- , > 
1=1 

that is, the Poisson binomial probability for the fingerprint vector y where 
the cross-validation accuracies (y,)™, i 6 [0.5, 1]"' of the fingerprints pre- 
diction are taken as the reliability scores. 

3 RESULTS 

Two MS/MS datasets, 978 compounds downloaded from 
METLIN (Tautenhahn et al., 2012) and 402 compounds from 
MassBank (Hisayuki et al., 2010), both measured by QTOF 
MS/MS instruments are tested. For each compound, mass spec- 
tra recorded at different collision energies were amalgamated 
before further processing: we normalize MS/MS spectra such 
that intensities sum up to 100%. We merge peaks from different 
collision energies with mjz difference at most 0.1, using the mjz of 
the highest peak and summing up intensities. We discard all but 
the 30 highest peaks, as well as peaks with relative intensity 
<0.5%. 

Next, we compute the fragmentation tree. We assume that we 
can identify the correct molecular formula from the data: limit- 
ing candidate molecular formulas to those present in KEGG 
(Kanehisa and Goto, 2000), which is used for searching molecu- 
lar structures below, the best scoring fragmentation tree identi- 
fied the correct molecular formula of the compound in 97.1% 
(96.0%) of the cases for the METLIN (MassBank) dataset. 
Integrating other sources of information such as MSI isotope 
patterns (Bocker et al., 2009) or retention times would reach 
even better identification rates. To allow for a meaningful com- 
parison of the power of the different kernels, we therefore use the 
best scoring fragmentation tree of the correct compound molecu- 
lar formula. 

All 1 1 fragmentation tree kernels proposed in the previous 
section were computed, along with PPK used in Heinonen 
et al. (2012) and Shen et al. (2013) computed directly from 
MS/MS, resulting in 12 kernels to be evaluated. 

Molecular fingerprints were generated using OpenBabel 
(O'Boyle et al, 2011) which contains four types of fingerprints 
(http://openbabel.0rg/wiki/Tutorial:Fingerpri11ts). FP3, FP4 and 



Table 1. Micro-average performance of individual kernels 





METLIN 




MassBank 




Acc (%) 


Fl (%) 


Acc (%) 


Fl (%) 


LB 


79.5 ±0.5 


69.9 ±0.9 


78.9± 1.0 


69.0 ±2.2 


LC 


79.4±0.3 


69.6 ±0.4 


78.5 ±1.2 


68.4±2.7 


LI 


77.8 ±0.5 


66.8 ±0.7 


77.4 ± 1.0 


66.7 ±2.0 


RLB 


81.6± 0.8 


73.2 ±1.1 


78.6 ± 1.0 


68.4 ± 1.2 


RLI 


78.4±0.6 


68.5 ±0.8 


76.7 ±0.9 


65.4 ± 1.6 


NB 


81.9 ±0.4 


73.9 ±0.3 


81.4 ±0.7 


73.2 ±1.2 


NI 


80.3 ±0.7 


71.1 ± 0.8 


79.8 ± 1.0 


70.5 ±0.9 


CPC 


80.6±0.5 


71.6±0.7 


78.7 ± 1.4 


68.9 ±2.4 


CP2 


78.7 ±0.7 


68.4± 1.2 


76.4 ± 1.0 


65.5 ± 1.1 


CPK 


72.9±0.3 


58.8 ±0.5 


72.2 ±0.6 


57.9 ±0.5 


CSC 


74.9 ±0.4 


61.9±0.8 


77.8±0.8 


67.2 ±2.0 


PPK 


76.7 ±0.6 


64.0 ±0.7 


72.9± 1.1 


58.6 ± 1.2 



PPK is the method from Heinonen et al (2012), which we compare against. 

MACCS fingerprints (528 bits in total) were generated based on 
the software predefined SMARTS patterns. In our dataset, more 
than half of the fingerprint properties have high-class bias rate, 
with a large majority of the dataset belonging to the positive class 
(most compounds match the property) or respectively the nega- 
tive class (most compounds do not match the property). For 
such fingerprints, the default classifier, one that always predicts 
the majority class, has high accuracy, although the model is not 
meaningful. For our performance comparisons, we opted to only 
include fingerprints with class bias rate <0.9. 

For each fingerprint property, we separately trained a SVM; 
for all properties, we used identical training and testing com- 
pounds. Five-fold cross-validation was performed and the 
SVM margin softness parameter (C e (2~ 3 , 2~ 2 , . . . , 2 6 , 2 7 }) 
was tuned based on the training accuracy. 

3.1 Fingerprint prediction performance 

The micro-average (simultaneous average over fingerprint prop- 
erties and compounds) accuracy and Fl of the individual kernels 
on the predictions of fingerprint properties with bias rate <0.9 
are shown in Table 1 with the SDs computed from different 
cross-validation folds. The kernel NB achieves the best accuracy 
and Fl on both METLIN and MassBank. Compared with the 
PPK, the fragmentation tree kernels are markedly more accurate 
on average. 

The improvement of MKL approaches over single kernel 
SVMs are clear. The /-test between NB and ALIGNF shows the 
differences of mean accuracy and Fl are indeed very significant 
with P-values of 4 x 10~ 6 and 1.7 x 10~ 3 , respectively. The kernel 
weights learned by different MKL algorithms are shown in the 
supplementary file. 

The micro-average accuracy and Fl of the MKL algorithms 
on the fingerprint properties predictions are shown in Table 2, 
where it can be concluded that averaged overall fingerprints of 
the MKL methods are quite close. We conducted further pair- 
wise difference testing, where the performance difference of each 
method on each individual fingerprint property is evaluated. 
Table 3 shows the significance level of the sign test on the 
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accuracy and Fl on the METLIN and MAS SBANK datasets using 
the different MKL methods. The sign test describes whether one 
of the methods has higher probability of success (better than the 
other on a fingerprint) than the other (alternative hypothesis) or 



Table 2. Micro-average performance of MKL algorithms 





METLIN 




MassBank 




Acc (%) 


Fl (%) 


Acc (%) 


Fl (%) 


UNIMKL 


85.0 ±0.6 


78.3±0.7 


82.2 ±0.6 


73.9± 1.5 


ALIGN 


85.2 ±0.6 


78.6±0.7 


82.4 ±0.7 


74.4 ± 1.4 


ALIGNF 


85.0 ±0.5 


78.6 ±0.4 


82.8 ±0.4 


75.2 ±1.2 


QCMKL 


84.9 ±0.5 


77.8 ±0.5 


82.1 ±0.6 


74.0 ±0.7 


£ 2 -MKL 


84.7 ±0.5 


77.5±0.5 


82.2 ±0.5 


74.0 ±0.9 


£ 3 -MKL 


85.2 ±0.6 


78.5 ±0.7 


82.4 ±0.6 


74.4 ± 1.3 


£ 4 -MKL 


85.2 ±0.6 


78.5 ±0.8 


82.3 ± 0.6 


74.2 ± 1.0 


£ 5 -MKL 


85.1 ±0.6 


78.5±0.7 


82.3 ±0.6 


74.1 ±1.3 



not (null hypothesis). From the table, we can deduce that ALIGN 
and ALIGNF rise slightly above the competition whereas £ 2 -MKL 
and QCMKL are slightly inferior to the rest. The performance of 
UNIMKL is also respectable. The scatter plots of accuracy and Fl 
between every pair of the MKL algorithms are shown in the 
supplementary file. 

3.2 Metabolite identification performance 

The molecular fingerprint prediction can serve as an intermediate 
step for metabolites identification, and can be used to search a 
molecular structure database (Heinonen et al., 2012; Shen et al., 
2013). We want to evaluate whether improvements in fingerprint 
prediction propagate to better metabolites identifications. We will 
search for molecular structures from the KEGG database. As we 
assume to know the correct molecular formula, we may filter 
based on this information to generate our candidate lists. But it 
turns out that this filter is too strict for a meaningful evaluation, 
as the number of candidates for each MS/MS spectrum becomes 
very small and, hence, all kernels show good performance. For a 



Table 3. Sign test for the performance of MKL algorithms on the METLIN and MassBank datasets 



MassBank 



METLIN 



MassBank 



Acc 


UNIMKL 


ALIGN 


ALIGNF 


QCMKL 


£ 2 -MKL 


£ 3 -MKL 


£ 4 -MKL 


UNIMKL 








+ + 


+ + 






ALIGN 


+ + 






+ + 


+ + 


+ 




ALIGNF 


+ 






+ + 


+ + 






QCMKL 
£ 2 -MKL 
£ 3 -MKL 










+ + 












+ + 


+ + 






£ 4 -MKL 


+ + 






+ + 


+ + 






£ 5 -MKL 


+ + 






+ + 


+ + 






UNIMKL 








+ 


+ 






ALIGN 


+ 






+ 


+ + 




+ + 


ALIGNF 


+ + 


+ + 




+ + 


+ + 


+ + 


+ + 


QCMKL 
















£ 2 -MKL 
















£ 3 -MKL 








+ 






+ 


£ 4 -MKL 








+ + 


+ 






£ 5 -MKL 








+ 


+ 






Fl 


UNIMKL 


ALIGN 


ALIGNF 


QCMKL 


£ 2 -MKL 


£ 3 -MKL 


£ 4 -MKL 


UNIMKL 










+ 






ALIGN 


+ 






+ + 


+ + 






ALIGNF 








+ 


+ + 






QCMKL 
£,-MKL 
£ 3 -MKL 










+ 






+ + 






+ + 


+ + 






£ 4 -MKL 


+ + 






+ + 


+ + 






£ 5 -MKL 


+ + 








+ + 






UNIMKL 
















ALIGN 


+ 






+ + 


+ + 






ALIGNF 








+ + 


+ + 






QCMKL 
















£ 2 -MKL 
















£ 3 -MKL 








+ + 








£ 4 -MKL 








+ + 








£ 5 -MKL 








+ 









£^-MKL 



' + ' indicates the method in the row is better than the method in the column ('— ' otherwise) with significance P-value between 0.01 and 0.05; blank indicates no significance. 
Similarly, '+ +' and ' — ' indicate significance with / > -va!ue<0.01. Upper table is for accuracy and lower table is for Fl. 
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Fig. 2. (a and b) show the performance for identification when searching KEGG using 300-ppm mass window with predicted molecular fingerprints, 
with fingerprints trained with METLIN and MassBank datasets, respectively. NUM denotes the number of candidate molecules returned per query, 
(c and d) show the proportion of data that were correctly identified in the top 1 rank against a series of mass windows 



more discriminative evaluation of the kernels, we artificially 
enlarge the set of candidates: we use all molecular structures in 
KEGG with mass accuracy window [fi M — A, ii M + A] as 
candidates, where fi M is the true mass of the unknown 
molecule. For sufficiently large mass accuracy A, this results 
in candidate lists that allow a meaningful comparison of the 
kernels. 

For identification, we want the true molecular structure to be 
ranked as high as possible in the candidates list. Figure 2a and b 
shows the fraction of compounds that were ranked higher than 
certain rank for the two datasets, when searching KEGG with 
300 ppm mass inaccuracy to generate the candidates for the two 
datasets. 

We notice that the NB kernel is consistently more accurate than 
PPK. In addition, MKL clearly improves the identification 
performance, especially the number of top-ranked identifications 



increases significantly. T-test between the ranks of the ALIGNF 
and PPK shows a P-value of 0.06 which verifies the 
improvements in identification by ALIGNF over the PPK is 
indeed significant. ALIGNF comes on top of the MKL 
approaches, which is in line with its good fingerprint prediction 
accuracy and Fl score. 

The effect of mass accuracy windows during the database re- 
trieval are shown in Figure 2c and d. A narrower 20-ppm mass 
search window filters out many false candidates, and thus sig- 
nificantly elevates the identification accuracies to 60% on 
METLIN dataset and 40% on MassBank dataset. However, 
the effect of improved molecular fingerprint prediction is sof- 
tened due to the fewer but possibly more similar candidates. 
An extreme case is observed in Figure 2d in which all the meth- 
ods shrink to the same result when searching with 20-ppm mass 
accuracy window. 
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4 DISCUSSION 

The present work combines the combinatorial fragmentation tree 
approach with machine learning through a kernel-based ap- 
proach. We suggest several kernels for fragmentation trees, and 
show how to fuse their information through MKL. The result 
significantly enhances molecular fingerprint prediction and me- 
tabolite identification. 

The closest analogs to our fragmentation tree kernels in litera- 
ture are those defined for parse trees in natural language pro- 
cessing (Collins and Duffy, 2001); our fragmentation trees can be 
seen as parses of the MS/MS spectra. DP techniques similar to 
ours are used there for computing kernels between trees (Collins 
and Duffy, 2001; Kuboyama, 2007). However, fragmentation 
trees have important differences to the trees defined between 
parses of natural language and to kernels comparing molecular 
structures (Mahe and Vert, 2009). Differently from natural lan- 
guage parses, the node labels have partial order (via their mo- 
lecular weights) and also the edges have labels. Differently from 
kernels for molecular graphs, the label spaces of both nodes and 
edges are vast (subsets of molecular formulae). 

The comparison with the PPK employed by the FingerlD 
(Heinonen et al., 2012) software shows that the fragmentation 
tree kernels are able to extract more information out of the MS/ 
MS spectra. Improvements are seen in both the prediction ac- 
curacy and the Fl score. Comparing with FingerlD (PPK), the 
uniform combination of the kernels (UNIMKL) improves the 
molecular fingerprint prediction significantly in accuracy and 
Fl. As witnessed by many MKL applications, the UNIMKL al- 
gorithm is hard to beat. In our result, several MKL algorithms 
such as ALIGNF and £ 3 -norm can give slightly better result than 
UNIMKL. The improvements in the molecular fingerprint predic- 
tion translate to improved metabolite identification. 

There are several possible routes forward with the current me- 
tabolite identification framework. First, post-processing on the 
candidates list, such as the one proposed by Allen et al. (2013), is 
necessary when searching a large compound database such as 
PubChem, because the returned candidates (hundreds to thou- 
sands) may share the same fingerprints and there is no way to 
differ them based only on molecular fingerprints. Second, train- 
ing a separate SVM for each fingerprint property is clearly an 
aspect that can be improved upon, for example, by a multi-label 
classification approach. A still more tempting yet challenging 
direction would be to replace the two-step identification by an 
integrated prediction approach. Such an approach would poten- 
tially learn to predict the fingerprint properties that are import- 
ant for discriminating metabolites from each other. 

Funding: Academy of Finland grant 268874 (MIDAS); Deutsche 
Forschungsgemeinschaft grant (BO 1910/16-1) (IDUN). 

Conflict of Interest: none declared. 
REFERENCES 

AlIen,F. et al (2013) Competitive fragmentation modeling of ESI-MS/MS spectra 

for metabolite identification. Pre-print. arXiv:1312.0264. 
B6cker,S. and Rasche,F. (2008) Towards de novo identification of metabolites by 

analyzing tandem mass spectra. Bioinformatics, 24, i49-i55. 



B6cker,S. et al. (2009) Sirius: decomposing isotope patterns for metabolite identifi- 
cation. Bioinformatics, 25, 218-224. 

Collins,M. and Duffy ,N. (2001) Convolution kernels for natural language. In 
Dietterich,T.G. et al. (ed.) Advances in Neural Information Processing Systems 
14. MIT Press, Cambridge, MA, pp. 625-632. 

Cortes,C. et al. (2012) Algorithms for learning kernels based on centered alignment. 
J. Mack Learn. Res., 13, 795-828. 

Demuth,W. et al. (2004) Spectral similarity versus structural similarity: mass spec- 
trometry. Anal. Chim. Acta, 516, 75-85. 

Gerlich,M. and Neumann,S. (2013) MetFusion: integration of compound identifi- 
cation strategies. J. Mass Spectrom., 48, 291-298. 

G6nen,M. and Alpaydin,E. (2011) Multiple kernel learning algorithms. J. Mach. 
Learn. Res., 12, 2211-2268. 

Heinonen,M. et al. (2012) Metabolite identification and molecular fingerprint pre- 
diction through machine learning. Bioinformatics, 28, 2333-2341. 

HilfD.W. et al. (2008) Mass spectral metabonomics beyond elemental formula: 
chemical database querying by matching experimental with computational frag- 
mentation spectra. Anal. Che/n., 80, 5574-5582. 

Hisayuki,H. et al. (2010) Massbank: a public repository for sharing mass spectral 
data for life sciences. J. Mass Spectrom., 45, 703-714. 

Hufsky,F. et al. (2014) Computational mass spectrometry for small molecule frag- 
mentation. Trends Anal. Chem., 53, 41-48. 

Jebara,T. et al. (2004) Probability product kernels. J. Mach. Learn. Res., 5, 819-844. 

Kanehisa.M. and Goto,S. (2000) KEGG: Kyoto encyclopedia of genes and gen- 
omes. Nucleic Acids Res., 28, 27-30. 

Kangas,L.J. et al. (2012) In silico identification software (ISIS): a machine learning 
approach to tandem mass spectral identification of lipids. Bioinformatics, 28, 
1705-1713. 

Kind,T. and Fiehn,0. (2007) Seven golden rules for heuristic filtering of 
molecular formulas obtained by accurate mass spectrometry. BMC 
Bioinformatics, 8, 105. 

Kloft,M. et al. (2011) ^-norm multiple kernel learning. J. Mach. Learn. Res., 12, 
953-997. 

Kuboyama,T. (2007) Matching and learning in trees. PhD Thesis, University of 
Tokyo. 

Lanckriet,G. et al. (2002) Learning the kernel matrix with semi-definite program- 
ming. J. Mach. Learn. Res., 5, 2004. 

Li,J. and Sun,S. (2010) Nonlinear combination of multiple kernels for support 
vector machines. In International Conference on Pattern Recognition, Istanbul. 
IEEE, pp. 2889-2892. 

Mahe,P. and Vert,J.-P. (2009) Graph kernels based on tree patterns for molecules. 
Mach. Learn., 75, 3-35. 

Oberacher,H. et al. (2009) On the inter-instrument and the inter-laboratory trans- 
ferability of a tandem mass spectral reference library: 2. optimization and char- 
acterization of the search algorithm. J. Mass Spectrom., 44, 494-502. 

0'Boyle,N. et al. (201 1) Open babel: an open chemical toolbox. J. Cheminform., 
3, 33. 

Pitkanen,E. et al. (2010) Computational methods for metabolic reconstruction. 
Curr. Opin. Biotechnoi, 21, 70-77. 

Rasche,F. et al. (201 1 ) Computing fragmentation trees from tandem mass spectrom- 
etry data. Anal. Chem., 83, 1243-1251. 

Rasche,F. et al. (2012) Identifying the unknowns by aligning fragmentation trees. 
Anal. Chem., 84, 3417-3426. 

Rauf,I. et al. (2012) Finding maximum colorful subtrees in practice. In Benny, C. 
(ed.) Research in Computational Molecular Biology. Volume 7262 of Lecture 
Notes in Computer Science. Springer, Berlin Heidelberg, pp. 213-223. 

Rojas-Cherto,M. et al. (2012) Metabolite identification using automated compari- 
son of high-resolution multistage mass spectral trees. Anal. Chem., 84, 
5524-5534. 

Scheubert,K. et al. (2013) Computational mass spectrometry for small molecules. 

J. Cheminform., 5, 12. 
Shen,H. et al. (2013) Metabolite identification through machine learning — tackling 

casmi challenge using FingerlD. Metabolites, 3, 484-505. 
Smith.C.A. et al. (2005) Metlin: a metabolite mass spectral database. Drug Monit., 

27, 747-751. 

Tautenhahn,R. et al. (2012) An accelerated workflow for untargeted metabolomics 
using the METLIN database. Nat. Biotechnoi, 30, 826-828. 

Wolf,S. et al. (2010) In silico fragmentation for computer assisted identification of 
metabolite mass spectra. BMC Bioinformatics, 11, 148. 



M64 



